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Abstract 

We consider the problem of reconstructing graphs or labeled graphs from neighborhoods of a 
given radius r. Special instances of this problem include DNA shotgun assembly, neural network 
reconstruction, and assembling random jigsaw puzzles. We provide some necessary and some 
sufficient conditions for correct recovery both in combinatorial terms and for some generative 
models including random labelings of lattices, Erdos-Renyi random graphs, and the random 
jigsaw puzzle model. Many open problems and conjectures are provided. 


1 Introduction 

In this paper we study the problem of inferring a (labeled) graph from a collection of radius r 
(labeled) neighborhoods of the graph. In particular we ask how large r must be to ensure that a 
given randomly generated graph with labels can be uniquely identified (up to isomorphism) by its r- 
neighborhoods. Note that if the neighborhoods are too small then identifiability may be impossible; 
if r = 1 and all of the vertex labels are the same, then the graph is only identifiable from its 1- 
neighborhoods if the degree sequence determines a unique graph. As far as we know graph shotgun 
assembly for generative models has not been considered before in the level of generality considered 
here. Some motivating examples include; 

• DNA shotgun assembly; the goal is to reconstruct a DNA sequence from “shotgunned” 
stretches of the sequence. The theoretical version of this problem is graph shotgun assembly 
of a line graph with each vertex corresponding to a site in the genome, and so is labeled with 
an A, C, G, or T standing for the nucleotides making up DNA. The neighborhoods are strings 
of adjacent vertices of length r, which are referred to as “reads”. Shotgun assembly is one of 
the major techniques for reading DNA sequences and so the theoretical problem is already 
well understood. A main question is to determine how large r has to be to reconstruct the 
sequence with good probability under different models of vertex labeling, see e.g., [ Arratia 
et ah, 1996|, [Dyer et ah, 1994| , and [Motahari et ah, 2013| and references therein. 

• Reconstructing neural networks; recent work in applied neuroscience identifies graph shotgun 
assembly as an important problem for reconstructing neural networks; the goal is to recon- 
stru ct a big neural network from subnetworks that are observed in experiments [ Soudry et ah, 

2013 ^ 

• The random jigsaw puzzle problem. Consider a jigsaw puzzle of size n x n where where the 
border between every two adjacent pieces is drawn uniformly at random using one in q shapes 
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of interfaces which we call “jigs.” How large should q be so that the puzzle can be recovered 
uniquely? How can this be done efficiently? 


The problem considered here is most closely related to the famous reconstruction conjecture in 
combinatorics [Kelly, 1957| [Harary, 1974| which can be stated as follows: a graph G on at least 3 
vertices is uniquely determined by the multi-set of all vertex-deleted subgraphs of G. Here a vertex 
deleted subgraph of G is a graph induced on all the vertices of G but one. In this paper we are 
interested in reconstructing (labeled) graphs from seemingly less information: given the graph we 
assume that we are given all (labeled) radius r neighborhoods in the graph. While the information 
is more localized, we make the additional assumption that either the graph structure or the labels 
are random. This makes the problem easier in comparison to the reconstruction conjecture. Indeed 
we show that for some popular random graphs models reconstruction is possible from relatively 
small neighborhoods. 

The graph shotgun problem is also related to the graph isomorphism problem [Babai et ah, 1980] . 
It is a famous open problem to determine the complexity of graph isomorphism. In fact, one may 
consider a variant of the graph isomorphism problem in our setup: given the neighborhoods of two 
samples of a generative model, determine if the samples are identical, or are drawn independently. 
Part of the difficulty of the problem in this setup is that it may be required to determine if 
two neighborhoods are isomorphic or not. While we leave the question of graph isomorphism for 
randomly generated graphs for future work, we note that some of the techniques used for the 
classical graph isomorphism problem are related to our results. In particular our techniques for 
studying dense random graphs in Section [4.21 resemble some of the algorithms suggested for graph 
isomorphisms for some subclasses of graphs jCai et ah, 1992| . 

We also note that the question of whether an infinite graph is determined by some collection 
of its finite subgraphs has been studied in the context of unimodular and transitive infinite graphs 
Aldous and Lyons, 2007 [Frisch and Tamuz, 2014| . 


1.1 General setup, models, main results 

A (deterministic or random) graph G = Gn with N vertices and labels (again possibly random) 
from a finite set on each vertex or edge is given. Each vertex v has a neighborhood J\fr{v) of “radius” 
r which could be all of the vertices at distance r or some variation (see the examples below); we 
assume that location of vertex v is given in J\fr{v). 

Ql. (Identifiability) Given each of the N neighborhoods Mr{v) for v a vertex in the network, can 
we correctly identify (up to isomorphism) the graph G and its labels? We view this question 
as having two parts: (a) combinatorial criteria for identifiability (or non-identifiability), and 
(b) the probability of identifiability under particular random generative models. 

Q2. (Reconstruction) Assuming identifiability for a given Gn and r, for 0 < e < 1, what is 
the minimum number, Mrec(?V, r, e), of samples (with replacement) from the collection of 
neighborhoods that is necessary to ensure that the chance of correctly reconstructing the 
network G with labels from the sample is at least 1 — e? 

Questions Ql(a) and Q2 are discussed in Section O where we derive general results about com¬ 
binatorial criteria for identifiability and upper and lower bounds on Mrec(A", r, e) based on coupon 
collecting. Notably our conditions for non-identifiability require that the graph is not isomorphic 
to small perturbations of the graph obtained by replacing a neighborhood with a non isomorphic 
neighborhood (thus avoiding the difficulty of the reconstruction conjecture). In Sections [H [2 and El 
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Question Ql(b) is discussed and the general results of Section [2] are applied in the following three 
examples. Let d(u, w) denote the distance between two vertices in a graph. 

1. ^ is the d dimensional n-lattice, here denoted Z^, with i.i.d. vertex labels from a probability 
distribution on {1, ■ ■ ■ ,q} and the neighborhoods Mr{v) are the {n — r — 1)'^ r-cubes. Here 
our neighborhoods differ slightly from the general setup and N = Nn,d,r := {n — r — 1)'^. 

2. ^ is an Erdds-Renyi random graph with vertex set V of size N and edge probability pisr 
where the vertices have no labels (or you can think of each having the same label) and the 
r-neighborhoods, J\fr{v),v E V, are the subgraphs induced by the vertices at distance no 
greater than r from each vertex. We also consider labeled generalizations of the model. 

3. The random jigsaw puzzle problem. Q is the n x n lattice and we view each vertex as being 
the center of a puzzle piece with each of the four edges receiving one of q jigs. Thus each 
vertex is labeled with an ordered 4-tuple of the q possible labels (jigs), corresponding to the 
label of each edge. Note that adjacent vertices have dependent labels. The neighborhoods 
A/o(u) are simply the vertices with labels and correspond to the puzzle pieces. 

The main question we address in these examples is what are conditions on r or q as —)• oo to 
ensure identifiability (or non-identifiability)? We now summarize a subset of our findings and open 
problems. 


Example [Tt Lattices In Section[3l we find that if the vertices of the lattice are labeled uniformly 
and independently then, up to constants, the asymptotic threshold of r for identifiability is log(n)^/‘^. 


Theorem 1.1. For with vertex labels i.i.d. uniform from fixed q labels and taking limits as 
n —)• oo, if for some e > 0, 


< (1 


d log n 
logg ’ 


then the probability of identifiability from r-neighborhoods tends to zero, and if for some e > 0, 


> {l + £)2d 


log n 
logg’ 


then the probability of identifiability from r-neighborhoods tends to one. 


We conjecture that: 

Conjecture 1.2. There exists a constant Cd{q) such that for every e > 0, when r^ > (1 -|- 
£)cd{q) logn, the probability of identifiability goes to I as n ^ oo, while when < (1 —e)crf(g) logn, 
the probability of identifiability goes to 0. 

More ambitiously we can ask: 

Question 1.3. Does there exist a constant Cd such that for every e > 0, when > (1 -|- 

the probability of identifiability goes to 1 as n ^ oo, while when < (1 — £)cd|^|^, the probability 
of identifiability goes to 0? 

In both cases finding the value of the constant, Cd{q) or Cd, is a challenging open problem. The 
case of non-uniform labels is also discussed in Section [3l 
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Example [2l Erdos-Renyi graphs. The results of Section [J] show that for A 7 ^ 1, the asymptotic 
threshold for identifiability in the sparse Erdos-R&yi random graph is log(iV) (up to constants). 

Theorem 1.4. For the Erdos-Renyi graph on N vertices with p^ = X/N and taking limits as 
N 00 , if for some e > 0 

r 1 

^ 2(A - log(A)) " 

then the probability of identifiability from r-neighborhoods tends to zero. 

• // A < 1 and for some s > 0, 

r 1 

log(Af) ^ log(l/A) 

then the probability of identifiability from r-neighborhoods tends to one. 

• // A > 1 and A* < 1 is the unique solution to Xe~^ = A*e“'^*, and for some s > 0, 

r 1 2 

log(A^) ^ log(A) log(l/A*) 

then the probability of identifiability from r-neighborhoods tends to one. 

For A = 1, the second statement of Theorem 14.21 below implies that if rN~^^^ —)• 00 , then the 
probability of identifiability tends to one, but this is far from the lower bound log(A^) provided by 
the previous result. We make the following conjecture: 

Conjecture 1.5. For positive A 7 ^ 1, there exists a constant c\ such that for every e > 0, when r > 
(l+e)cA logiV, the probability of identifiability tends to 1 as —>■ 00 , while when r < {l—e)c\ logiV, 
the probability of identifiability goes to 0 . 

Natural open problems are to prove the conjecture, find the value of c\, and also to better 
understand the critical case where A = 1. The cases of sparse Erdos-R&yi with labels and Erdos- 
Renyi with unbounded degrees are also studied in Section 01 In particular, in the most technical 
result in the paper we show that if p^ = uj{log{N)'^/N) then neighborhoods of size 3 are enough 
to ensure identifiability: 

Theorem 1.6. If Q is the Erdos-Renyi random graph with N vertices and edge probability pM 
satisfying Np^/ log{N)‘^ —>■ 00 as N ^ 00 and we are given A/ 3 (u) for each vertex v in Q, then the 
probability of identifiability tends to one. 

Example [3l Jigsaw puzzle. In Propositions 15. ll and 15.21 we show that if (7 = o(n^/^), then the 
probability of identifiability tends to zero and if g = uj{n^), then the probability of identifiability 

tends to one. We do not believe that either the constant 2/3 or the constant 2 is sharp but 

conjecture there is a critical exponent: 

Coujecture 1.7. Eor the jigsaw puzzle problem, there exists a constant c such that for all e > 0 if 

• q < n‘^~^ then the probability of identification goes to 0 as n ^ 00 and if 

• q > then the probability of identification goes to 1 as n ^ 00 . 
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A number of additional open problems and conjectures are given in each section and we conclude 
the paper with a summary of these and other outstanding questions in Section [6l In this paper we 
only consider either unlabeled graphs or graphs that have i.i.d. labels. However, we emphasize that 
the questions considered here can be naturally extended to labelings of the graph outside of the i.i.d. 
case. For example the graph may be colored by an Ising model or by a uniform proper coloring. Thus 
the study of graph shotgun assembly raises new problems in random graphs, percolation, Ising/Potts 
models, as well as algorithmic problems regarding random constraint satisfaction problems and the 
theory of spin glasses. 

Except for the case of dense ER random graphs and the DNA shotgun assembly problem, 
none of the graph shotgun results are tight. We conclude the introduction with another family of 
examples for which it is easy to derive tight bounds. 

The labelled full binary tree. Let Tn be the full binary tree with 2” leaves and label each 
vertex uniformly from the letters {1,... , g}. We are given the 1-neighborhoods A/i(u) of the 2^ — 2 
vertices that are not leaves or the root (so we see the labels of the vertex, its two children, and its 
parent). 

Proposition 1.8. Let e > 0. If 

< log(2) - 

n 

then the probability of identifiability of the labeled binary tree Tn from 1-neighborhoods tends to zero. 

If 

n 

then the probability of identifiability of the labeled binary tree Tn from 1-neighborhoods tends to one. 

Proof. To prove the first assertion, note that if there are two vertex disjoint edges between levels 
n — 2 and n — 1 of the tree having endpoints with identical labels, then with good probability 
reconstruction is impossible since we can switch the cherries below these edges (which have different 
labels with good probability) and obtain a non-isomorphic (again with good probability) labeling 
of the tree with the same neighborhoods. Thus we lower bound the probability of this event using 
the second moment method. Actually it’s enough to consider neighborhoods of vertices at level 
n — 1 which are odd-numbered when labeled sequentially 1,2,..., 2"'“^ starting from the left. Let 
B = Bn,q be the number of pairs of such neighborhoods where the central vertices have the same 
label and the parent vertices have the same label (possibly different from the central vertices) and 
the two pairs of leaves have different labels (as sets). Writing B = where the sum is 

over all such pairs of neighborhoods (a, /3) and ATa,/? is the indicator of the event just described, we 
compute 

EH < 2^^'^-‘^\l/qf{l - 1/q^). 

After noting that the labels being chosen uniformly implies the are independent, we find 

VarH = ^ Var(A„,^) < EH, 

and the first claim of the proposition follows by the second moment method. 

Eor the second part of the claim, it’s clear that if no two edges have the same labels, then we 
can piece together the tree from the neighborhoods by overlapping distinct edges. The mean of the 
number of pairs of edges with the same labels is bounded above by 

22 n-H2( 1/^)2^ 
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which tends to zero under the hypothesis of the second statement of the proposition and so the 
result follows. □ 

2 Combinatorial and sampling results 

We introduce two concepts that can be used to determine identifiability: blocking configurations 
and uniqueness of overlaps. For concreteness, specialize to the case where for each vertex v, Mr{v) 
is the labeled subgraph induced by the vertices at distance no greater than r from each vertex. 

2.1 Blocking configurations 

A blocking configuration is a neighborhood structure or pattern such that if it appears then iden¬ 
tifiability is impossible. For a given example, there can be a number of different blocking config¬ 
urations, though that described in Lemma 12.11 below is most likely in our examples. In random 
models, we use blocking configurations to get upper bounds on the asymptotic neighborhood size 
to ensure non-identifiability: if the neighborhoods grow too slowly, then the chance that a blocking 
configuration appears tends to one and identifiability is impossible (or the probability is bounded 
away from zero and so identifiability isn’t assured). 

For t > s > 0 and vertex v, define the sphere (or shell) S{v;s,t) to be subgraph induced by 
edges connecting vertices having distance to v between s and t (inclusive). Note that S{v, s,t) has 
no isolated vertices. 

Lemma 2.1. If Q is such that there is an r > 0 and vertices v,w such that 
(i) S(v;l,2r) =S(w;l,2r), 

(ii) d(v,w) > 2r, and 

{in) the graph obtained by switching A/i(u) and Mi{w) in Q is not isomorphic to Q, 
then identifiability from r-neighborhoods is impossible. 

Proof. We claim that there are at least two non-isomorphic labeled graphs having the same r- 
neighborhoods as Q: the true one, G, and one where Mi{v) and Mi{w) are switched, denoted by 
G' . Condition (i) ensures that such a switch is possible since the number of vertices at distance one 
connecting to vertices at distance two and their labels agree for v and w. Condition {Hi) ensures 
that G and G' are not isomorphic (and note in particular that this implies Mi{v) Mi{w)). Denote 
by A/"/ the r-neighbor hoods generated by G' ■ 

We only need to show that G and G' generate the same r-neighborhoods (including multiplic¬ 
ities). From {ii), there is no vertex having both v and w in its G r-neighbor hood. Thus we can 
split vertices into two groups: those being within distance r of exactly one of u or re in and 
those having distance greater than r from both of v and w. For any vertex x in the latter group, 
the differences in switching Mi{v) and Mi{w) are not reflected by (potential) neighbors of v and w 
that are at distance r from x (since the labels and positions of such vertices have to match), and 
so Afr{x) = Afl{x). 

For the group of vertices within distance r (in G) of one of v or w, Condition (i) implies there 
is an obvious matching of each vertex x that satisfies either 

• 2 < d(x, v) < r (distance in G) or, 

• d(x, v) = 1 and x has a neighbor at distance two from v, 
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to one having the same distance from w and identical label. Moreover, under this matching, 
Mr{x) = and M^{x) = Mr{y)- Finally, by (i), for x = v,w or a neighbor of u or tc with 

no neighbors at distance 2 from v or w, Mr{x) = Ml.{x). Thus G and Q' generate the same r- 
neighbor hoods. □ 

Remark 2.2. Condition (iii) seems a bit unnatural and possibly hard to verify. Indeed, it is 
difficult to check in situations where the graph G has many symmetries since the graph isomorphism 
problem is computationally difficult. However, such symmetry is rare in random graphs and so in 
our applications of the lemma. Condition (in) is easy to verify. We also note that the condition is 
reasonable to impose given the difficulty of the “reconstruction conjecture” that has been open for 
more than 50 years. 

2.2 Uniqueness of overlaps 

The next result formalizes the intuition that if all of the neighborhoods of a certain size are unique, 
then slightly larger neighborhoods are enough to ensure identifiability. In random models, we 
use uniqueness of overlaps to get lower bounds on the asymptotic neighborhood size to ensure 
identifiability. If the neighborhoods grow quickly enough, then the chance that all neighborhoods 
of a slightly smaller size are unique tends to one and identifiability is ensured. 

Lemma 2.3. If Mr-i{v) / Mr-i{w) for all vertices v ^ w, then there is an efficient algorithm for 
recovering the graph from r-neighborhoods. 

Proof. We can sequentially build the network by overlapping neighborhoods of radius r — 1. Start 
with some r-neighborhood Mr{v) and note that the (r — l)-neighborhood of each neighbor of v 
is contained in Mr{v) and these are all unique by assumption. Thus for each vertex w ^ v, we 
examine the (r — l)-neighborhoods of neighbors of w and overlap any of these matching the (r — 1)- 
neighborhoods of neighbors of v. Repeating this process for each neighbor of v and then continuing 
for the vertices at distance 2,3,... from v, it’s clear that the process terminates when a connected 
component is recovered. □ 

Remark 2.4. The proof of the lemma is simple because we assume we see not only Mr{v), but 
also which vertex in the neighborhood is the “center” (namely, v). We do not investigate here how 
to relax this condition to the situation where the center v is not given. 

2.3 Sampling 

In the regime where we have uniqueness of (r— l)-neighborhoods, then the coupon collector problem 
yields bounds on the probability of reconstruction. Let MreciM,r,£) be the minimum number of 
samples so that the chance the graph can be reconstructed from the samples is least 1 — e. 

Lemma 2.5. If for some r, Mr-i{v) / Mr-i{w) for all vertices v ^ w, then 

MreciN, r, e) < (iV log(iV) - N log e]. 

Proof. The proof of Lemma 12.31 implies that it’s enough to see all of the neighborhoods, possibly 
in multiplicities, since then we can build the network by overlapping the (r — l)-neighborhoods 
of neighbors of the sampled vertex. The bound in the lemma now easily follows from coupon 
collecting; if T is the number of samples with replacement required to collect N distinct coupons, 
then a union bound implies that for integer M > 0 , 

P(r > M)< iV(l - l/N)^ < Ne~^/^. 
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Now setting M = l'N'log(N') — A^loge], we find 


P(Can’t reconstruct with M samples) < P(T > M) < s, 


and so M,.ec{N,r,s) < M. 


□ 


Since there is no hope of reconstruction if there is some vertex that doesn’t appear in any of the 
sampled neighborhoods, we can also use coupon collecting to get a lower bound on Mrec(-/V, r, e) in 
the general case. Let |A/’r(u)| denote the number of vertices in J\fr{v). 


Lemma 2.6. If M is such that 



f T \J\fr{'Vi)UAfr{vj)\'\^ 

N ) 


> e, 


then Mrec(./V, r, s) > [xj . 

Proof. Let Wm be the number of vertices that have not appeared in some neighborhood in a sample 
of size M. If Wm > 0, then we can’t reconstruct with M samples and so by the second moment 
method, 

P(Can’t reconstruct with M samples) > P(ILm > 0) > (2.1) 

and for any M such that the right-most side of (|2.1I) is greater than e, the chance of reconstruction 
is at most 1 — e which implies M < Mrec(N^, ?', &)• The result now follows by computing 


N 


KWm = ^ 1 


Z=1 

N 


\MriVi)\ 


N 


M 


EWl= ("l- \^riVi)\^J\friVj)\ Y ^ 

i,j=l ^ I 


□ 


3 Labeled lattice models 

Recall the setting of Example [TJ ^ is the d>2 dimensional n-box with i.i.d. vertex labels and 
neighborhoods the r-boxes contained in Z^; note that for these neighborhoods the position of v is 
irrelevant. Our results for i.i.d. uniform labeling are different than the general i.i.d. case. 


3.1 Uniform labels 


Assume the vertices of Z(( are labeled uniformly from q > 2 labels. Our first result uses blocking 
configurations to obtain an upper bound on the growth of r to ensure a positive chance of non- 
identifiability. 


Proposition 3.1. Given the r-neighborhoods of 7/^ with vertex labels i.i.d. uniform from q labels, 
the following holds as n ^ oo. 

• if {n/r)'^‘^q~^‘^^^‘^ oo, then the probability of identifiability tends to zero, and 


• i/liminf 

n—>-oo 


[n/rY'^q 


> 0, then the probability of identifiability is strietly less than one. 










Proof. We lower bound the probability of the following blocking configurations given by a pair of 
non-overlapping ( 2 r — l)-neighborhoods that have identical labels except for the two center vertices 
which are different. We only consider neighborhoods of the form x + [0, 2r — 1]'^ where all of the 
coordinates of x are 0 modulo 2r — 1. Similar to Lemma EH if two such neighborhoods exist, 
then identifiability is impossible since there are at least two ways to construct a consistent layout of 
neighborhoods, by switching the labels of the center vertices. Note further that the probability that 
there is an isomorphism of the graph excluding these two neighborhoods is at most 2 '^ x ( 1 /g)"' 
(since there are 2 '^ possible rotations and each site has to match the label of one other site). 

To establish the existence of the neighborhood pair, we use the second moment method. Let 
B = Bn^d,r,q denote the number of such blocking configurations described above and we compute 
KB and Ei?^. Assume that n ^ r (without loss under the hypotheses of the proposition) and 
denote the set of such ( 2 r — l)-neighborhoods of by L = Tn,d, 2 r-i (note that |r| = 0 ((n/ 2 r)'^)) 
and write 

B= 

o,/3Gr,an/3=0 

where A^a,/? is the indicator of the event that the labels of a and /3 are equal except for the center 
labels which must be different and a n /3 = 0 means a and j3 are non-overlapping. It’s easy to see 
that = (l/g)^^^“^^”^“^(l — 1/q) which implies that 

KB > 0((n/2r)2'^)(l/g)(2’'-i)‘'-i(l - l/g), (3.1) 

The fact that B is concentrated follows from the fact that the nre pairwise independent: 

if the labels are chosen uniformly, then for two pairs of neighborhoods {a, (5) ^ ( 7 , <^), and 

are independent. Thus 


Var(S) = Var(A„,^) < KB. 

«,/3Gr,on,a=0 

Now the proof follows by the second moment method. □ 

We can use uniqueness of overlaps as in Lemma 12.31 to find a regime where asymptotic recon¬ 
struction is assured. 

Proposition 3.2. If —)• 0 as n ^ oo, then the probability of identifiability (of with 

i.i.d. uniform on q vertex labels) from r-neighborhoods tends to one. 

Proof. Let Y := Yn^d,r,q be the number of pairs of different (r — l)-neighborhoods that have the 
same labels and we show that ET ^ 0 as n ^ 00 , from which the result follows from a minor 
variation of Lemma 12.31 

Similar to the proof of Proposition 13.11 denote the set of (r — l)-neighborhoods of Z^ by 
r = Kn,d,r-i and for a, /3 G T, let Y(^a,fS) be the indicator that a and j3 have the same labels. It’s 
obvious that if an/3 = 0 (meaning the two neighborhoods share no vertices) then 
but since the labels are uniform, straightforward considerations (see below) show that in fact 

(3.2) 


for all a ^ ft. Thus we find 

ET = ^ Ey(„,^) = [(n - r)2^ - 
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To prove ()3.2I1 formally assume WLOG that {x,y) {x,y) — {i,j) is an injective map from /3 to 

a where i,j > 0 and at least one of i and j is non-zero. Then we can label a U /3 according to 
lexicographic order where 

• If a site is in a \ /3 then we label it arbitrarily. 

• If it is in /? then we label it by looking at the site (x,y) — {i,j) which was already labeled. 

This defines all labelings of a U /? where a and /? have the same label so the number of such 
labelings is while the total number of labelings of a U /3 is The proof follows. □ 

Theorem 11.11 in the introduction is easily established by combining Propositions 13.11 and 13.21 
3.2 Non-uniform labels 

If the labels are i.i.d. but not uniform, we can prove a (weaker) analog of Proposition 13.21 Let 
Pi denote the chance of label i appearing at a site and Vj = denote the probability that j 

particular sites have the same label. 

Proposition 3.3. If —)• 0 os n —>■ oo, then the probability of identifiability (of 

with i.i.d. vertex labels) from r-neighborhoods tends to one. 

Proof. As in the proof of Proposition 13.21 let Y be the number of (r — l)-neighborhoods that have 
the same labels and we show that ET —)• 0 as n ^ oo. Similar to the proof of Proposition 13.11 
denote the set of (r — l)-neighborhoods of by P = Tn,d,r-i and for a, (3 E P, let T(q,,/ 3 ) be the 
indicator that a and /3 have the same labels. It’s obvious that if a n /3 = 0 (meaning the two 

neighborhoods share no vertices) then =1^2 . If a n /3 7 ^ 0, then 

(3.3) 

i>2 

where j x kj are the number of sites in the union of a and (3 that need to be matched to j — 1 other 
sites to ensure P(a,/ 3 ) = 1 (c.f., the justification of (13.2j) at the end of the proof of Proposition 13.2h . 
Note that Ylj>2(3 ~ = (r — 1)”* and that Ylj >2 % = |a U /3| — (r — 1)'’*, since this sum is equal 

to |a//3|. Using the basic inequality Vj < for j > 2 in (|3.3I1 . we find 

i>2 

the last inequality is since |aU/3| > (r — 1)'^. Counting the number of overlapping and non¬ 
overlapping neighborhoods, we find 

from which the result easily follows. □ 

Remark 3.4. If the labels are uniform, then Vj = and so we can use this exact quantity 

(rather than the inequality Vj < in (|3.3I) in the proof of Proposition 13.31 to recover the sharper 

Proposition 13.21 
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For non-uniform vertex labels, the correlations between the appearance of overlapping blocking 
sets can become significant and so the second moment method of Proposition 13.11 breaks down. 
Still we believe that similar results should hold: 

Conjecture 3.5. Consider a distribution tt that is fully supported on {1 ,and the labeling of 

by i.i.d. labels from vr. For every dimension d, there exists a constant Cd(7r) such that for every 
e > 0, when > (1 -)- e)crf(7r) logn, the probability of identifiability tends to one as n ^ oo, while 
when < (1 — e)crf(7r) logn, the probability of identifiability goes to 0. 

We believe that conjecture 13.51 should also extend to some dependent setups including: 

• The uniform distribution of legal vertex colorings of a box with q > 3d colors. We require 
that q is large to ensure correlation decay of the distribution. Note for example that if g = 2 
and d > 2, then the problem is degenerate as there are only two possible colorings of the 
graph. 

• The Ising and Potts models with finite temperature 0 < /3 < oo in the box. 

Proving the conjectures and establishing the value of the threshold in these examples are fasci¬ 
nating open problems. 


3.3 Sampling 

If has uniqueness of (r — l)-overlaps (asymptotically assured in the regimes of Propositions 13.21 
and 13. 3 j) . then the argument of Lemma 12.51 automatically implies an upper bound of A^(log(iV) — 
log(A^)) (recall N = Nn^d,r •= {n — r — 1)^ is the number of neighborhoods) on , r), the 

minimum number of samples needed to reconstruct the abels of the lattice with probability at 
least 1 — e. We can also use Lemma 12.61 to show that we need at least of order (large N, small e) 
^ (log(iV/r'^) — (log(e)) samples to reconstruct in any regime. 

Proposition 3.6. For with vertex labels, 


Mrec(iV,e,r) > 


log (i - 1) - log 


- log (1 - 


N 


(3.4) 


Proof. We may use Lemma 12.61 with this neighborhood structure since its argument only relies 
on the size (and not the structure) of the neighborhoods. First note |A/'r(u)| = r^ for all v and 
|A/’r.(u) U A/’r(t(;)| = 2r^ if Afr{v) nAfr{w) = 0 and \Afr{v) U A/’r.(tc)| > r‘^ otherwise. Using these 
bounds, if M is no greater than the right hand side of (j3.4h . then 



N J 


> 


> 




2M 


(2r- l)'^ 


M 


1 + 


N 


1 - 


N 


-M' 


> e, 


and the result follows. 


□ 
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4 Erdos-Renyi graph 


Assume the setup of Example [2 Q is the Erdos-Renyi random graph with N vertices, the vertices 
have no labels (or to fit our setup, all labels are the same) and for each vertex v, we have the 
r-neighborhoods Mr{v) which are the subgraphs induced by vertices at distance < r from v. This 
example fits exactly into our general setup and so Lemmas 12.11 and 12.31 can be applied “out of the 
box”. As is typical for Erdos-Renyi random graphs, the results differ if the graph has bounded 
average degree or not and so we separate our results accordingly to Sections 14.11 and 14.21 

4.1 Bounded average degree Erdos-Renyi 

Let Q be the Erdos-Renyi random graph with N vertices and edge probability pN = X/N for some 
A > 0. We use the blocking configuration of Lemma l2.ll to show the following result. 

Proposition 4.1. For the Erdos-Renyi graph on N vertiees with pM = X/N, using the notation of 
the previous paragraph, and taking limits as N ^ oo, 

• if^/NX^{l-X/N)^^ oo, then the probability of identifiability tends to zero, and 

• i/liminf y/NX'"{1 — X/N)^'^ > 0, then the probability of identifiability is strictly less than one. 

N^oo 

Proof. Note that A(1 — X/N)^ < 1 and so if r grows faster than log(A^), then neither of the 
hypotheses of the proposition are satisfied, and so we can assume without loss that r/A^“ —>• 0 for 
all a > 0. We lower bound the probability of the appearance of the following blocking (induced) 
subgraph on 4r -|- 6 vertices: the subgraph has two components, one a line graph on 2r -|- 1 vertices 
and the other a line graph on 2r -|- 1 vertices with the addition of both end vertices being connected 
to two other vertices with no other edges to form “prongs”; see EigureHl 



Eigure 1: Example of blocking subgraph for neighborhoods of radius r. The line graph has 2r -|- 1 
vertices. 

Note that this blocking set satisfies the hypotheses of Lemma [2.1 1 bv taking v to be an endpoint 
of the line graph and w to be one of the degree three vertices. Alternatively, it’s easy to see that if 
such a subgraph is present, then identifiability is impossible because there are at least two ways to 
construct the graph consistent with the neighborhoods, by switching one of the prongs to the line 
graph; see Eigure [2] for illustration. 



Eigure 2: A subgraph that has the same r-neighborhoods as that of Eigure [T] 

Let B = be the number of such (induced) subgraphs of G and write B = Xloer where 

r = rAr,4r--i-6 is the collection of subsets of vertices of size 4r -|- 6 and for a E T, Xq is the indicator 
that the blocking subgraph of Eigure [l] is the induced subgraph of ^ on a. The are equally 
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distributed and for a 7 ^ /3, if a n /? / 0, then X^Xp = 0. Thus we find for (say) a = {1,..., 4r + 6 } 
and /? = {4r + 7,... ,4r + 12}, 


EB 


N \ 
4r + 6 y 


IEX„, 


= Eb(i+ ®)e[X„|X„ = 1|) . 


From this point we need to compute and E[X,g|XQ = 1], There is at most one copy of the 
blocking (induced) subgraph on a, but there are a number of ways the subgraph can appear. By 
enumeration and noting the chance that any potential way the subgraph can appear, we find 


EX« 


()) } 1) ("'} ') ()) 


(4.1) 


the first binomial coefficient counts the number of ways of assigning 2r + 1 vertices of a to the 
line graph, the second assigns four of the remaining vertices to the prongs and for each of the 
(2r + l)-lines, there are (2r + l)!/2 ways to put them in order; the final factor of 2 comes from 
assigning the pairs of prong vertices to an end. Once the vertices are assigned, there are 4r + 4 
edges that must appear, each with probability pw, and 2(2r — 1){N — 3) + 6{N — 2) + 2{N — 4) 
edges that must not appear. 

Similarly, given X^ = 1, none of the vertices of a have edges connecting to vertices outside of 
a and so Xp\Xa = 1 is distributed as Xp, but on an Erdos-Renyi graph on X — 4r — 6 vertices and 
chance of edge pat. Thus we use (j4.ip but with X — 4r — 6 replacing X (except in p^v) to find 

E|A',|X„ = 1] = (I } ')) (4.2) 

Putting together (|4.ip and (14.2p and using that under either of the hypotheses of the proposition, 
r/X“ —)■ 0 for any o > 0, we find 

{EBf ^ (X - 4r - 6)^^+%^+^(l - 
ER2 - 8 + (X - 4r - - p^){N-2r){4r+6) ’ 

and under the first hypothesis of the propoosition, the numerator and the denominator tend to 
infinity at the same rate, and under the second, the numerator on the right hand side stays bounded 
away from zero. □ 


If r is larger than the diameter of the graph, then clearly we can identify from the neighborhoods. 
Thus we can use k nown results on the diameter of the Erdos-Renyi random graph (see [ Riordan 
and Wormald, 2010|, [Luczak, 1998| , [Nachmias and Peres, 2008| , [Addario-Berry et ah, 20T^ ) to 
get a lower bound on the growth of r to guarantee identihability. Denote convergence in probability 
by — 


Theorem 4.2. Let Qn be the Erdos-Renyi random graph on X vertices with edge probability = 

X/N for a fixed A > 0 and let D = Djv,a io be the maximum diameter of a component of Qn- 


• ILuczak, 1998 . 


Theorem 11] If X<1, then T) 7 v,a/ log(X) l/log(l/A). 


• \Nachmias and Peres, 200^ Theorem 1.1], jAddario-Berry et al, 201^ Theorem 5] If X = 1, 
then N-^I^Dn,i converges in distribution to a non-negative and non-degenerate distribution. 
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fRiordan and Wormald, 2010 , Theorem 1.1] //A > 1, and A* < 1 is the unique solution to 
Xe~^ = A*e“'^*, then 

Dn,\ V 1 2 

log(iV) ^ log(A) log(l/A*)' 


Theorem oi in the introduction summarizes the lower bound on the neighborhood size for 
identifiability given by Proposition ld.ll and the upper bounds given by the properties of the diameter 
of Theorem 14.21 


Labeled Erdos-Renyi. Assuming vertices have i.i.d. labels from a finite set and we let V 2 be 
the chance that two given vertices have the same label, we show the following result. 

Proposition 4.3. For the labeled Erdos-Renyi graph with = X/N, using the notation of the 
previous paragraph, and assuming V 2 1, if for some e > 0, 

r 1 

log(A^)-21og(l-iP2) ^ 2A-21og(A)-log(iP2) " 

then the chance of identifiability tends to zero as N ^ 00 . 

Proof. The argument is nearly identical to the proof of Proposition 14.11 but now the blocking 
configuration is two isolated line graphs with 2r + 1 vertices, both having the same labels in the 
2r — 1 middle vertices, and each having two different labels at the endpoints; switching labels of an 
appropriately chosen endpoint (being careful of symmetries) from each line graph results in a non¬ 
isomorphic labeled graph with the same neighborhoods. If B is the number of such configurations, 
then the result follows from the second moment method after computing 

^ (iV - 4r - 2)^^+X'(l - - V 2 f ^ 

ER2 - 8-g(Ar_4r-2)4^+2p4):(l-p^)(^^-2r-4)(4r-2)p2r-l('^_p^^2- 

We make the following conjecture. 

Conjecture 4.4. Consider a distribution n that is fully supported on {1 ,... ,q} and the i.i.d. vr- 
vertex labeling of the Erdos-Renyi random graph on N vertices with parameter X/N. Eor positive 
A 7 ^ 1, there exists a constant c\{7r) such that for every e > 0, when r > (1 -|- e)cA(vr) log A", 
the probability of identifiability tends to one as N ^ 00 , while when r < (1 — e)cA( 7 r) log A, the 
probability of identifiability tends to 0. 

Open problems are to establish the conjecture, determine the value of CA(vr), and understand 
the critical case where A = 1. 


4.2 Dense Erdos-Renyi graph 

Now we assume that Q is the Erdos-Renyi random graph with A vertices and edge probability 
p]y such that as A —>■ 00 , Apjv/log(A)^ —00 and the neighborhoods are as before, described in 
Example [2j We restate and prove Theorem 11.61 from the introduction. 

Theorem 4.5. If Q is the Erdos-Renyi random graph with A vertices and edge probability pM 
satisfying Ap^r/ log(A)^ —>■ 00 as N ^ 00 and we are given M 3 {v) for each vertex v in Q, then the 
probability of identifiability tends to one. 
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Proof. If pjsf > , then the diameter of G is at most 3 [Bollobas, 1981| and so we can assume 

without loss that pn ^ 

We show that the chance of the event “each vertex v has distinct 2-neighborhood A 2 (u)” tends 
to one and then the result follows by the uniqueness of overlaps Lemma (2.31 If v and w are distinct 
vertices of G, then it’s enough to show as N ^ oo, 

{Af2{v) = Af2{w)) 0. (4.3) 

In order for A 2 (u) = A/ 2 (rc), the degree of v (deg(u)) must be equal to that of w and the degrees 
of the neighbors of v and w must be equal as multi-sets. Note that we can write deg(u) = By + I 
and deg(t(;) = By, -|- I where By and By, are independent with distribution Bi(N' — 2,pn) and I is 
the indicator that v and w have an edge between them. We bound the chance that v and w have 
the same degree and the chance of sharing too many neighbors as follows. 

1. The Chernoff bound of Lemma 14.61 applied to the binomial distribution implies that for all 

0 < ei < 1/2, 

P (deg(u) G Np]y{l ± ei)) > 1 — 2 exp | —^A^pArj . 

2. Noting that the event deg(u) = deg(t(;) is independent of I, the indicator that v and w have 
an edge between them, we use the local limit theorem for the binomial distribution to find 
for C not depending on N, 

P(deg(u;) = deg(u)| deg(u) G Np^^l ± ei)) < ^ - 

yJNpN 

3. Let M := deg(u)I [deg(t(;) = deg(u) G N'pAr(l ± ei))] be the common degree of v and w 
assuming the conditioning of the items above hold (and zero otherwise), and let K = 
M — |A/i(u) n A/i(rc)| -|- 7 be the number of neighbors of v and w that are connected to 
exactly one of v or w. Given M > 0, the neighbors of v and w are each chosen uniformly 
from the N — 1 possible neighbors. Thus if v and w are not neighbors, then M — 77 is 
hypergeometric with M draws, M marked balls and N — 2 total balls and if v and w are 
neighbors, then M — 77 is hyper geometric with M — 1 draws, M — 1 marked balls and N — 2 
total balls. In either case, after noting that hyper geometric distributions can be represented 
as sums of independent indicators [Pitman, 1997| the Chernoff bound of Lemma 14.61 implies 
that for 0 < £2 < 1/2, 

F {M - K e N-^{1 ± e 2 )\M > 0) > 1 - 2exp |-^iVp 7 v| . 

Note that if M > 0, then M G Np^^l ± ei) and so the event M — K ^ ± £ 2 ) 

implies M — K ^ MpAr(l ± ei ± £ 2 ) and that 

77 > M(1 -piv(l + £i + £ 2 )) > Npn{1 - 2pn)/2, 

77 < M(1 -pAr(l - £1 - £ 2 )) < 2NpN. 


Given M > 0 and 77, let {77(u)} := {77i(u),..., Dk{v)} and {D{w)} := {Di{w ),... , Dk{w)} 
denote the multi-set of degrees of the 77 non-intersecting neighbors of v and w, respectively. The 
three items above imply the following bound. 


P(A/ 2 (u) =M 2 {w)) < 2exp |-^A^pAr| 


-7 


2C exp | —^A^pAr| 


VNpN 


(4.4) 
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c 

V^PN 




{D{w)}\M > 0,Np]\f{l — 2pAr)/2 < K < 2Npn] . 


(4.5) 


Since pN > log(iV)^/A^, the first two terms of (I4.4p are easily seen to be o(l/A^^) so we only need 
to bound (|4.5I) . 

Write Di{v) = Vi+Ai, where V) is the number of edges between v and the N—2K — 1 vertices not 
in {Mi{v) L)Afi{w)) / {Afi{v) r\Mi{w)) and Ai the number edges between v and the remaining 2K—1 
potential neighbors. Similarly, write Di{w) = Wi + Bi. Note that given M > 0 and K, {Vi,..., Vk} 
and {Wi, ..., Wk} are two independent (unordered) collections of i.i.d. Bi(A^ —2^ — 1,pn) random 
variables that are also independent of the Aj’s and Bj’s. 

We show (i) that with good probability Ai and Bi are bounded by a constant and (ii) that the 
chance that independent binomial multi-sets are within constants is small. 

For (i), the ^j’s and Bi's form the degrees of an Erdos-R&yi graph on 2K vertices with edge 
parameter pjv and so are each marginally distributed Bi(2iF — l,PAr). Thus, 


P(^j < x,Bi < x,i 


2KF{Ai >x)>l-2K(^^y {{2K - 1 )pnT , 


where we have used standard tail bounds on the binomial distribution in the Poisson regime stated 
in Lemma 14.61 Note that setting x = 13 (any x > 12 works) and using that pN < we find 

that if iF < 2NpN, then 

P(max{Ai,Bj} > 13) < o(iV“^). (4.6) 

i 

Assume here and below that N is large enough so that 1 — 2pN > 2/3. At this point we only 
need to show that for {Pi,..., Vk} and {Wi ,..., Wk} two independent (unordered) collections of 
i.i.d. Bi(A^ — 2K — 1,pn) random variables with Np^/^ < Npk{1 — 2pAr)/2 < K < 2NpK, and for 
hxed non-negative Ai,..., Ak,Bi, ..., Bk such that each Ai and Bi are no greater than 13, 


P({Pi +Ai,...,Vk + Ak} = {Wi + Bi,...,Wk + Bk}) = o{l/N^). (4.7) 


Rather than dealing with the multi-sets, we look instead at the (nearly multinomial) vectors of 
counts. For i = —13, ... ,N — 2K + 12, let W = |{j : + Aj = i}| be the number of the {Vj + Aj)’s 

that are equal to i and 1/ = |{j : Wj + Bj = i}| be the analogous counts for the (Wj + Bj)’s. The 
left hand side of ()4.7p is equal to 

P {Xj = Yj,j = -13, ...,N-2K + 12)<F (^Xj^ = Yj^,i = 0,..., - 1^ , (4.8) 

where a > 0 will be chosen later and we dehne ji = [A^p^vJ -|- i. To shorten formulas define the 
index set I = T(A', a) := {0,..., [ay/Np^l — !}• We bound the probability (j4.8p by showing hrst 
that for an appropriate 5 > 0, 

F{Xj^ > (1 + 6)EXj^, for some i£l) = o{N-‘^), (4.9) 

and then that given Wj. < (1 -I- (5)EWj. for all i £ I, we apply the local central limit theorem to the 
Yj^ (represented as sums of independent Bernoulli variables) to show that the event on the right 
hand side of (14.8h has chance o(iV“^). 

To show (14.91) . first note that by the local central limit theorem for the binomial distribution 
(noting that pN 0), there are positive constants ci = ci(q;) and C 2 such that for all f E X and 
k = l,...,K, 

< FiVk + Ak= ji),FiWk + Bk= ji) < (4.10) 
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Thus, for each jj, Xj^ is a sum of K independent Bernoulli variables, each having success probability 
upper and lower bounded as per (I4.10|] . and using this, a union bound. Lemmaand the bounds 
on K and pat, we have 


P(Xj. > (1 + 6)KXj^, for some i £ I) 



< 2a^/Nj^exp | y V| 

< 2aN^/^N~^^o{l). 


Now choosing 

1 +V^l + 40ci/27 
lOci/27 


shows (14.9p is satisfied, since for this choice of <5, 


Cl 

2 + (5'3 


+ 1/5 = -2. 


To finish the proof, we show that for an appropriate choice of a (small), 

P {Xj^ = Yj^,i E l\Xj^ < (1 + 5)EXj^, all i£l)= o{N-^). 

Let Kq = K and Ki = K — Yll=o ^je define T) to be the sigma field generated by Yj^,... ,Yj^. 
Observe that for each i E X, given Xi_i, Y)-. is a sum of Ki Bernoulli variables, each having success 
probability Q satisfying (using ()4.10l) i 


i/Njm 1 - ici/i/NpN l-iC 2 /y/Nm V^PN 

So we demand that (1 — 002 ) > 0 which is not an issue: changing a affects only ci and 5 in the 
argument above. Moreover, by decreasing a, we increase ci, and as a ^ 0, ci stays bounded 
from above (since it’s no greater than C 2 ) and thus so does S. The local central limit for sums of 
independent Bernoulli variables implies that 


P (Tj, = Xj^jXj^ < (1 + 6)EXj^ all i E X; Yj^ = Xj^ all £ = 0,... ,i - 1; X)_i) 

,n-i/2 (4.11) 


< C 


Ki 


Cl 


1 - 


C2 


V^PN V V^PN 


:(1 — q ;C 2 ) 


-1 


for some constant C. Now the condition that Xj^ < (1 + 6)EXj^ and the lower bound on K implies 
that 


Ki > 


NpN 

3 


> 


KpN 

3 


2—1 


{1 + 5)Y,EX,, 


£=0 


{1 + 6)Y,^X,^> 


KpN 

3 


(1 + 6)a2c2NpN, 
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where we have used that EXj. < KC 2 /Np^ < 2c2 \/Npj^. By choosing a small enough (so that 
1/3 — 2 c 2(1 + (5)a > 0) we find that Ki is at least of order Npjq for all i G X so that (jd.lip is 
O • Now moving through X sequentially, we have 

P [Xj^ = Yj^,i G X\Xj^ < (1 + 5)EXj^, all i G I) 

= exp^-^y/NpN (log(iVpAr) + 0(1))| 

< exp |-^ log(iV) (log(log(iV)) + 0(1))| = o(iV"^). 

□ 


Lemma 4.6. Let X be the sum of independent indicators. Then for any e > 0, 


P(X < EX(1 - e)) < exp { -yEX } , 


F{X > EX{1 + e)) < exp <^ - 


2 + e 


-EX 


If X is a binomial distribution and x > 0, then 

P(X > x) < {EXy. 

Proof. The first statement is a standard Chernoff bound for sums of independent indicators. The 
second follows in the usual way but we prove this particular form. For any 0 > 0, a direct 
computation yields 

P(X > x) < e-^^Ee^^ < exp |EX(e® - 1) - 0x} . 

Setting 6 = log(l + x/EX) in the previous formula and simplifying yields 

as desired. □ 


We finish the section with a couple open problems. In Theorem 14.51 is it possible to identify 
from 2-neighborhoods? What happens in the regime of pTv we don’t handle, where = 

0(log(X)ViV)? 


5 The Random Jigsaw Puzzle 

Consider a factory that manufactures jigsaw puzzles - with the goal of producing individual unique 
puzzles that can be assembled. Since the images on the puzzle might not be informative (e.g. if 
there is a large patch of sky), the factory aims to make sure that a unique assembly of the puzzle 
is guaranteed just from the shape of the interface of the pieces. Assume that there are q different 
type of interfaces which we call “jigs” and the puzzle is of size n x n. How large should q be so 
that the puzzle can be uniquely assembled? Note that intuitively assembly of the puzzle is harder 
the smaller q is. In this section we provide upper and lower bounds on q in terms of n to determine 
identifiability. The scaling between q and n is stated as an open problem. 
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5.1 Formal description of the model 

We use a more intuitive description than that of Example [3] in the introduction. The puzzle is 
given by an n X n grid of squares where adjacent squares share an edge. Each edge of a square 
in the grid is colored uniformly at random from one of q colors. A pieee of the puzzle consists of 
a “vertex” at the center of the square along with the four adjacent colored edges. Vertices at the 
edge of the grid have an edge on the border of the grid so that each vertex has exactly four edges 
associated to it. Given two pieces both having an edge of the same color, we assume that there is 
a unique way to connect the two pieces (i.e. there are no symmetries in the jigs). The input to the 
problem is all pieces and the desired output is the original composition of the puzzle. 

We hrst use blocking configurations to obtain an easy negative result. 

Proposition 5.1. If q = o{n'^^^) then the probability of identification goes to 0 as n ^ oo. 

Proof. Call a pair of piece aligned if it is at position {j, 2i), (j, 2i + 1). Let be the indicator 

of the following event. Consider the map tt : {x,y) {x — j + f,y — 2i + 2i'). Let be 

1 if all edges emanating from {j,2i), {j,2i + 1) have the same color as their tt images except that 
the edge connecting {j,2i) and {j,2i + 1) has a different color than its image under tt. Note that 
if = 1 then there isn’t a unique solution to the puzzle as the two aligned parts can be 

exchanged. Note that here use the fact that with high probability there are no automorphism of 
the labelled puzzle (even excluding two neighborhoods). Let 

Then KXij^i/j/ = q~^{l — l/(?) and moreover, it is easy to check that the are pairwise 

independent. Thus 

Var(V) = ^ Vai(Xi< E[y] 

,j') 

It follows that if n^q~^ —>■ oo then E[y] —>■ oo and so by the second moment method, P[y > 1] —)• 1, 
concluding the proof. □ 

On the other hand, if g S> n^, then by considering expectations, the number of edges with 
the same color tends to zero in probability and identification is trivial, so if g = a;(n^), then the 
probability of identification tends to 1 as n —oo. In fact we can do better. 

Proposition 5.2. If q = u{n^) then it is possible to assemble the puzzle with probability tending to 
one. More formally, if q = then there exists an algorithm such that the probability it correctly 

assembles the puzzle (up to rotations) tends to one. 

Proof. We show that with probability tending to one, we can assemble the puzzle by first joining 
edges with colors that appear exactly once in the puzzle and then filling in any remaining holes. 
Write q = 2cn(n + 1) and let m = 2n(n + 1) be the number of edges. Let U be the number of colors 
which appear exactly once. Then 

EC/ = q—{l - 1/q)^-^ > m(l - 1/c). 

Q 

Also note that C/ is a function of the independent edge colors such that if a single color changes, 
then U can change by at most 2. Thus we can apply McDiarmid’s inequality for bounded differences 
to obtain that 

P(C/ > m(I — 2/c)) > 1 — exp 
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Given U, the locations of the edges that receive unique colors is exchangeable and so on [/ > 
m(l — 2/c), U dominates the Bernoulli-(1 — 3/c) product measure on edges with chance at least 
1 — exp{—m(l — 2/c)^/(3 — 1/c)}, using, e.g., the Chernoff bound of LemmaThus, on the good 
event that U > m(l — ‘Ijc) and at most m(l — ^jc) of the Bernoulli variables are 1, we can generate 
the locations of the unique colored edges by first generating the Bernoulli variables on edges and 
then adding the appropriate number of unique colors to the remaining edges chosen uniformly at 
random. 

If c is large enough so that 1 — 3/c > 0.9 (say), then standard results in percolation theory 
Grimmett, 1999[ (8.97-8)] imply that the graph induced by the positive Bernoulli variables in the 
box (which on the good event are dominated by the unique edge color indicators) has a connected 
component touching all boundaries. Once such a component is determined (up to rotations), it is 
not hard to complete the puzzle. By considering expectations, the probability of having two pieces 
that share two or more colors tends to zero. Thus given a location of a piece neighboring two pieces 
that are already assembled - i.e., an empty corner - there is a unique piece that can fit there. 

Consider the process of starting with component formed by joining edges with unique colors 
and then repeatedly adding pieces to vacant corners. With probability tending to one, when this 
process terminates, the collection of vertices covered has no empty corners. It is easy to see that 
this implies that the complete puzzle has been recovered. □ 

Remark 5.3. We have assumed that “edge” pieces of the puzzle cannot be distinguished from 
interior pieces. If the edge pieces can be distinguished, then the proposition still holds since with 
probability tending to one it is possible to construct the border by matching colors that only appear 
once on the border and then filling in the interior using corners as is done in the proof above. It’s 
interesting that without the border, we need a non-trivial result from percolation theory to start 
the algorithm. 

6 Conclusion and Additional Open Problems 

A number of open problems regarding sharper bounds and extension to other models are mentioned 
in the text and can be summarized as follows: 

Problem 6.1. For the graph shotgun problem on boxes in with labels given by i.i.d., Ising, Potts 
model, proper coloring etc., find the threshold for the graph identification problem. 

It is natural to consider canonical fixed graphs other than the lattice. As illustrated in the 
introduction, the case of regular trees should be rather straightforward for many of these models. 
However, other families of graphs may be amenable to analysis, e.g., expander graphs. 

Problem 6.2. For the graph shotgun problem on a random graph model, e.g., Erdos-Renyi, pref¬ 
erential attachment, configuration, random regular graphs, etc., find the threshold for the graph 
identification problem. 

This question applies to both the labeled and unlabeled case. It is also interesting to understand 
if the graph identification problem shares properties of other constraint satisfaction problems: 

Problem 6.3. Are there graph shotgun problems for which there is a “computationally hard” but 
identifiable regime. 

This problem identifies graph shotgun assembly as a constraint satisfaction problem: for each 
neighborhood we have to find all intersecting neighborhoods. In the language of constraint satis¬ 
faction, the problem would be classified as planted, meaning that we start from a solution and then 
impose constraints based on the solution. 
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